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Abstract. We give polynomial-time algorithms for computing the val- 
ues of Markov decision processes (MDPs) with limsup and liminf objec- 
tives. A real-valued reward is assigned to each state, and the value of 
an infinite path in the MDP is the limsup (resp. liminf) of all rewards 
along the path. The value of an MDP is the maximal expected value of 
an infinite path that can be achieved by resolving the decisions of the 
MDP. Using our result on MDPs, we show that turn-based stochastic 
games with limsup and liminf objectives can be solved in NP n coNP. 

1 Introduction 

A turn-based stochastic game is played on a finite graph with three types of 
states: in player- 1 states, the first player chooses a successor state from a given 
set of outgoing edges; in player-2 states, the second player chooses a successor 
state from a given set of outgoing edges; and probabilistic states, the successor 
state is chosen according to a given probability distribution. The game results in 
an infinite path through the graph. Every such path is assigned a real value, and 
the objective of player 1 is to resolve her choices so as to maximize the expected 
value of the resulting path, while the objective of player 2 is to minimize the 
expected value. If the function that assigns values to infinite paths is a Borel 
function (in the Cantor topology on infinite paths) , then the game is determined 
[12]: the maximal expected value achievable by player 1 is equal to the minimal 
expected value achievable by player 2, and it is called the value of the game. 

There are several canonical functions for assigning values to infinite paths. 
If each state is given a reward, then the max (resp. min) functions choose the 
maximum (resp. minimum) of the infinitely many rewards along a path; the 
limsup (resp. liminf) functions choose the limsup (resp. liminf) of the infinitely 
many rewards; and the limavg function chooses the long-run average of the 
rewards. For the Borel level-1 functions max and min, as well as for the Borel 
level-3 function limavg, computing the value of a game is known to be in NP 
n coNP [10]. However, for the Borel level-2 functions limsup and liminf, only 
special cases have been considered so far. If there are no probabilistic states (in 
this case, the game is called deterministic) , then the game value can be computed 
in polynomial time using value-iteration algorithms [1]; likewise, if all states are 



given reward or 1 (in this case, limsup is a Biichi objective, and liminf is a 
coBiichi objective), then the game value can be decided in NP n coNP [3]. In 
this paper, we show that the values of general turn-based stochastic games with 
limsup and liminf objectives can be computed in NP n coNP. 

It is known that pure memoryless strategies suffice for achieving the value of 
turn-based stochastic games with limsup and liminf objectives [9]. A strategy is 
pure if the player always chooses a unique successor state (rather than a proba- 
bility distribution of successor states); a pure strategy is memoryless if at every 
state, the player always chooses the same successor state. Hence a pure memory- 
less strategy for player 1 is a function from player- 1 states to outgoing edges (and 
similarly for player 2). Since pure memoryless strategies offer polynomial wit- 
nesses, our result will follow from polynomial-time algorithms for computing the 
values of Markov decision processes (MDPs) with limsup and liminf objectives. 
We provide such algorithms. 

An MDP is the special case of a turn-based stochastic game which contains no 
player-1 (or player-2) states. Using algorithms for solving MDPs with Biichi and 
coBiichi objectives, we give polynomial-time reductions from MDPs with limsup 
and liminf objectives to MDPs with max objectives. The solution of MDPs with 
max objectives is computable by linear programming, and the linear program 
for MDPs with max objectives is obtained by generalizing the linear program 
for MDPs with reachability objectives. This will conclude our argument. 

Related work. Games with limsup and liminf objectives have been widely 
studied in game theory; for example, Maitra and Sudderth [11] present several 
results about games with limsup and liminf objectives. In particular, they show 
the existence of values in limsup and liminf games that are more general than 
turn-based stochastic games, such as concurrent games, where the two players re- 
peatedly choose their moves simultaneously and independently, and games with 
infinite state spaces. Gimbert and Zielonka have studied the strategy complexity 
of games with limsup and liminf objectives: the sufficiency of pure memoryless 
strategies for deterministic games was shown in [8] , and for turn-based stochas- 
tic games, in [9]. Polynomial-time algorithms for MDPs with Biichi and coBiichi 
objectives were presented in [5], and the solution turn-based stochastic games 
with Biichi and coBiichi objectives was shown to be in NP n coNP in [3]. For 
deterministic games with limsup and liminf objectives polynomial-time algo- 
rithms have been known, for example, the value-iteration algorithm terminates 
in polynomial time [1] . 

2 Definitions 

We consider the class of turn-based probabilistic games and some of its sub- 
classes. 

Game graphs. A turn-based probabilistic game graph (2 1 /2-player game graph) 
G = ((S, E), (Si, S2,Sp), 5) consists of a directed graph (S, E), a partition (Si, 
S2, Sp) of the finite set S of states, and a probabilistic transition function 5: 



Sp — > T>(S), where 2?(5) denotes the set of probability distributions over the 
state space S. The states in Si are the player-l states, where player 1 decides the 
successor state; the states in 5*2 are the player-2 states, where player 2 decides 
the successor state; and the states in Sp are the probabilistic states, where the 
successor state is chosen according to the probabilistic transition function 5. We 
assume that for s G Sp and t G S, we have (s,t) G E iff S(s)(t) > 0, and we 
often write S(s,t) for S(s)(t). For technical convenience we assume that every 
state in the graph (S, E) has at least one outgoing edge. For a state s G S, we 
write -E(s) to denote the set {t G S \ (s,t) G £?} of possible successors. The £urn- 
based deterministic game graphs (2-player game graphs) are the special case of 
the 2!/2-player game graphs with Sp = 0. The Markov decision processes (l 1 /^- 
player game graphs) are the special case of the 2 1 /2-player game graphs with 
Si = or S 2 = 0. We refer to the MDPs with 5 2 = as player-l MDPs, and to 
the MDPs with Si = as player-2 MDPs. 

Plays and strategies. An infinite path, or a play, of the game graph G is an 
infinite sequence u) — (so, si,S2, • ■ •) of states such that (sfe, Sk+i) G E for all 
k G N. We write fl for the set of all plays, and for a state s G S, we write 
f2 s C for the set of plays that start from the state s. A strategy for player 1 is 
a function er: S* ■ Si — > £>(£) that assigns a probability distribution to all finite 
sequences it) G 5* • 5i of states ending in a player-l state (the sequence represents 
a prefix of a play). Player 1 follows the strategy a if in each player-l move, given 
that the current history of the game is w G S* ■ Si , she chooses the next state 
according to the probability distribution a(w). A strategy must prescribe only 
available moves, i.e., for all w G S* , s G Si, and t G S, if cr(w ■ s)(t) > 0, then 
(s, t) G E. The strategies for player 2 are defined analogously. We denote by E 
and II the set of all strategies for player 1 and player 2, respectively. 

Once a starting state s G S and strategies a G E and it <E II for the two 
players are fixed, the outcome of the game is a random walk wj'^ for which 
the probabilities of events are uniquely defined, where an event A C Q is a 
measurable set of plays. For a state s G S and an event A C J?, we write 
Pr^ ,7r (^l) for the probability that a play belongs to ^4 if the game starts from 
the state s and the players follow the strategies a and it, respectively. For a 
measurable function / : Q — > IR wc denote by Eg'*^/] the expectation of the 
function / under the probability measure Pr^ 7r (-). 

Strategies that do not use randomization are called pure. A player-l strat- 
egy a is pure if for all w G S* and s G Si, there is a state t G 5 such that 
er(it> • s)(t) = 1. A memoryless playcr-1 strategy does not depend on the history 
of the play but only on the current state; i.e., for all w,w' G S* and for all 
s G Si we have o~(w ■ s) = o~(w' ■ s). A memoryless strategy can be represented as 
a function a: Si — ► T>(S). A pure memoryless strategy is a strategy that is both 
pure and memoryless. A pure memoryless strategy for player 1 can be repre- 
sented as a function a: Si —> S. We denote by £ PM the set of pure memoryless 
strategies for player 1. The pure memoryless player-2 strategies 77 PM are defined 
analogously. 



Given a pure memoryless strategy a e S PM , let G a be the game graph 
obtained from G under the constraint that player 1 follows the strategy a. The 
corresponding definition G w for a player-2 strategy n G LT PM is analogous, and 
we write G CT;7r for the game graph obtained from G if both players follow the 
pure memoryless strategies a and n, respectively. Observe that given a 2 1 /2- 
player game graph G and a pure memoryless player-1 strategy a, the result 
is a player-2 MDP. Similarly, for a player-1 MDP G and a pure memoryless 
player-1 strategy a, the result G a is a Markov chain. Hence, if G is a 2!/2-player 
game graph and the two players follow pure memoryless strategies a and w, the 
result G a ^ is a Markov chain. 

Quantitative objectives. A quantitative objective is specified as a measurable 
function / : Q — > JR. We consider zero-sum games, i.e., games that are strictly 
competitive. In zero-sum games the objectives of the players are functions / 
and — /, respectively. We consider quantitative objectives specified as limsup 
and liminf objectives. These objectives are complete for the second levels of the 
Borel hierarchy: limsup objectives are II2 complete, and liminf objectives are 
£2 complete. The definitions of limsup and liminf objectives are as follows. 

— Limsup objectives. Let r : S — > H be a real-valued reward function that 
assigns to every state s the reward r(s). The limsup objective limsup assigns 
to every play the maximum reward that appears infinitely often in the play. 
Formally, for a play u) = (si, S2, S3, . . .) we have 

limsup(r)(w) = limsup(r(s i )) i > . 

— Liminf objectives. Let r : S — > IR be a real- valued reward function that as- 
signs to every state s the reward r(s). The liminf objective liminf assigns to 
every play the maximum reward v such that the rewards that appear even- 
tually always in the play is at least v. Formally, for a play ui = (s\,S2, S3, . . .} 
we have 

liminf(r)((jj) = liminf (r(si))j>o- 

The objectives limsup and liminf are complementary in the sense that for 
all plays w we have limsup(r)(u;) = — liminf(— r){w). 

We also define the max objectives, as it will be useful in study of MDPs with 
limsup and liminf objectives. Later we will reduce MDPs with limsup and 
lim inf objectives to MDPs with max objectives. For a reward function r : S — ► JR 
the max objective max assigns to every play the maximum reward that appears 
in the play. Observe that since S is finite, the number of different rewards ap- 
pearing in a play is finite and hence the maximum is defined. Formally, for a 
play u) — (si, s 2 , s 3 , . . .} we have 

max(r)(w) = max(r(si))i> . 

Biichi and coBiichi objectives. We define the qualitative variant of limsup 
and liminf objectives, namely, Biichi and coBiichi objectives. The notion of 



qualitative variants of the objectives will be useful in the algorithmic analysis 
of 2 1 /2-player games with limsup and liminf objectives. For a play u, we dchnc 
Inf(w) = { s G S | Sk = s for infinitely many k > } to be the set of states that 
occur infinitely often in u>. 

— Biichi objectives. Given a set B C S of Biichi states, the Biichi objective 
Biichi(i3) requires that some state in B be visited infinitely often. The set 
of winning plays is Buchi(_B) = { u G fl | Inf(o») fl B ^ }. 

— co-Biichi objectives. Given a set C C S of coBiichi states, the co-Biichi 
objective coBiichi(C) requires that only states in C be visited infinitely often. 
Thus, the set of winning plays is coBuchi(C) = { oj G fl Inf(w) C C }. 
The Biichi and coBiichi objectives are dual in the sense that Biichi(B) = 
^\coBiichi(S'\B). 

Given a set B C 5", consider a boolean reward function tb such that for all 
s G S we have rs(s) = 1 if s G B, and otherwise. Then for all plays ui 
we have w G Biichi(_B) iff limsup(rs)(o;) = 1. Similarly, given a set C C S, 
consider a boolean reward function rc such that for all s G S we have rc{s) = 1 
if s G C, and otherwise. Then for all plays u we have u G coBiichi(C) iff 
liminf (rc)(u) = 1. 

Values and optimal strategies. Given a game graph G, qualitative objectives 
C fl for player 1 and fl \ # for player 2, and measurable functions / and — / 
for player 1 and player 2, respectively, we define the value functions {{l)) va i and 
((2))vai for the players 1 and 2, respectively, as the following functions from the 
state space S to the set IR of reals: for all states s G S, let 

«!»£,(#)(*) = sup mi : PW); 

- sup inf E?*[/]; 
((2))°Jfl\<P)(s) = sup inf Pr?*(rt\<P); 
((2»Srf(-/)(*) - sup inf E?*[-/]. 

In other words, the values ((l))^ ;(^)(s) and ((l))^ a ;(/)(s) give the maximal prob- 
ability and expectation with which player 1 can achieve her objectives and / 
from state s, and analogously for player 2. The strategies that achieve the values 
are called optimal: a strategy a for player 1 is optimal from the state s for the 
objective <P if ((l))^ ;(^)(s) = inf w6 77 Pr'J' 7r (<£); and er is optimal from the state s 
for / if ((l))^ a ;(/)(s) = inf T6 77 EJ' 71 ^/]. The optimal strategies for player 2 are de- 
fined analogously. We now state the classical determinacy results for 2 !/2-player 
games with limsup and liminf objectives. 



Theorem 1 (Quantitative determinacy). For all 2 1 /2-player game graphs 
G = ((S, E), (51,5*2, Sp), 5), the following assertions hold. 



1. For all reward functions r : S — > H and all states s G S, we have 

; (limsup(r))( S ) + <<2})G ; (liniinf(-r))( S ) = 0; 

«l))^(Iiminf(r))(«) + ((2}}^(limsup(-r))( S ) = 0. 

2. Pure memoryless optimal strategies exist for both players from all states. 

The above results can be derived from the results in [11]; a more direct proof 
can be obtained as follows: the existence of pure memoryless optimal strategies 
for MDPs with limsup and liminf objectives can be proved by extending the re- 
sults known for Buchi and coBiichi objectives. The results (Theorem 3.19) of [7] 
proved that if for a quantitative objective / and its complement — / pure memo- 
ryless optimal strategies exist in MDPs, then pure memoryless optimal strategies 
also exist in 2 i/^-player games. Hence the pure memoryless determinacy follows 
for 2 1 /2-player games with limsup and liminf objectives. 

3 The Complexity of 2 1 /2-Player Games with 
Limsup and Liminf Objectives 

In this section we study the complexity of MDPs and 2!/2-player games with 
limsup and liminf objectives. We present polynomial time algorithms for MDPs 
and show that 2 1 /2-player games can be decided in NP f~1 coNP. In the next 
subsections we present polynomial time algorithms for MDPs with limsup and 
liminf objectives by reductions to a simple linear-programming formulation, and 
then show that 2 ^-player games can be decided in NP n coNP. We first present 
a remark and then present some basic results on MDPs. 

Remark 1. Given a 2 1 /2-player game graph G with a reward function r : S — ► IR 
and a real constant c, consider the reward function (r + c) : S — > IR defined as 
follows: for s e S we have (r + c)(s) = r(s) + c. Then the following assertions 
hold: for all s € S 

((l))^(limsup(r + c))(s) = ((l))^(limsup(r))(s) + c; 

((l))^(liminf(r + C ))( S ) = ((l))^(limmf(r))( S ) + c. 

Hence we can shift a reward function r by a real constant c, and from the value 
function for the reward function (r+c), we can easily compute the value function 
for r. Hence without loss of generality for computational purpose we assume that 
we have reward function with positive rewards, i.e., r : S — ► IR + , where H + is 
the set of positive reals. 

3.1 Basic results on MDPs 

In this section we recall several basic properties on MDPs. We start with the 
definition of end components in MDPs [5, 4] that play a role equivalent to closed 
recurrent sets in Markov chains. 



End components. Given an MDP G = {{S 1 E),{S 1 ,S P ) 1 8), a set U C S of 
states is an end component if U is ci-closcd (i.e., for all s G {/ PI 5p we have 
£(s) C JJ) and the sub-game graph of G restricted to U (denoted G \ U) is 
strongly connected. We denote by £{G) the set of end components of an MDP 
G. The following lemma states that, given any strategy (memoryless or not), 
with probability 1 the set of states visited infinitely often along a play is an end 
component. This lemma allows us to derive conclusions on the (infinite) set of 
plays in an MDP by analyzing the (finite) set of end components in the MDP. 

Lemma 1. [5, 4] Given an MDP G, for all states s e S and all strategies a G S, 
we have Pr^ ({ w | Inf (w) G £(G) }) = 1. 

For an end component U G £(G), consider the memoryless strategy ojj that 
at a state s in U (~1 5i plays all edges in E(s) n U uniformly at random. Given 
the strategy o\j , the end component U is a closed connected recurrent set in the 
Markov chain obtained by fixing u\j. 

Lemma 2. Given an MDP G and an end component U G £{G), the strategy 
ojj ensures that for all states s G U, we have Pt° u ({ w \ Inf(w) = {/}) = 1. 

Almost-sure winning states. Given an MDP G with a Biichi or a coBiichi 
objective for player 1, we denote by 

W 1 G (<P) = {seS\{{l)) val ( ( I>)(s) = l}; 

the sets of states such that the values for player 1 is 1. These sets of states are 
also referred as the almost-sure winning states for the player and an optimal 
strategy from the almost-sure winning states is referred as an almost-sure win- 
ning strategy. The set '(<£), for Biichi or coBiichi objectives <P, for an MDP 
G can be computed in 0(n?) time, where n is the size of the MDP G [2]. 

Attractor of probabilistic states. We define a notion of attractor of prob- 
abilistic states: given an MDP G and a set U C S of states, we denote by 
Attrp(U,G) the set of states from where the probabilistic player has a strategy 
(with proper choice of edges) to force the game to reach U. The set Attrp(U, G) 
is inductively defined as follows: 

T = U; T l+1 =T l U{seS P | E(s) nT l ^9}U{seS 1 | E(s) C T 4 } 

and Attr P (U,G) =\Ji>o T i- 

We now present a lemma about MDPs with Biichi and coBiichi objectives 
and a property of end components and attractors. The first two properties of 
Lemma 3 follows from Lemma 2. The last property follows from the fact that 
an end component is J-closed (i.e., for an end component U, for all s G U D Sp 
we have E(s) C U). 

Lemma 3. Let G be an MDP. Given B C S and CCS, the following assertions 
hold. 



1. For all U £ £(G) such that UC\B^%, we have U C (Biichi(B)) . 

2. For all U £ £(G) such that U C C, we have U C (coBiichi{C)) . 

5. For all Y C S and all end components U £ £(G), if X = Attrp(Y, G), then 
either (a) U C\Y ^ $ or (b) U C\ X = $. 

3.2 MDPs with limsup objectives 

In this subsection we present polynomial time algorithm for MDPs with limsup 
objectives. For the sake of simplicity we will consider bipartite MDPs. 

Bipartite MDPs. An MDP G = ({S,E), (S 1 ,S P ),5) is bipartite if E C Si x 
S P U S P x Si. An MDP G can be converted into a bipartite MDP G' by adding 
dummy states with an unique successor, and G' is linear in the size of G. In sequel 
without loss of generality we will consider bipartite MDPs. The key property 
of bipartite MDPs that will be useful is as follows: for a bipartite MDP G = 
((S, E), (Si, S P ), S), for all U e £(G) we have £/ n Si ^ 0. 

Informal description of algorithm. We first present an algorithm that takes 
an MDP G with a positive reward function r : S — ► IR + , and computes a set 
S* and a function /* : S* — > IR + . The output of the algorithm will be useful in 
reduction of MDPs with limsup objectives to MDPs with max objectives. Let 
the rewards be vo > v\ > ■ ■ ■ > Vk- The algorithm proceeds in iteration and in 
iteration i we denote the MDP as Gi and the state space as S l . At iteration i 
the algorithm considers the set Vi of reward Vi in the MDP Gi, and computes 
the set Ui = W 1 '(Buchi(V^)), (i.e., the almost-sure winning set in the MDP 
Gi for Biichi objective with the Biichi set Vi). For all u £ Ui D Si we assign 
f*(u) = Vi and add the set UPiSi to S*. Then the set Attrp(Ui, Gi) is removed 
from the MDP Gi and we proceed to iteration i + 1 . In G, all end components 
that intersect with reward Vi are contained in Ui (by Lemma 3 part (1)), and 
all end components in S* \ U do not intersect with Attrp(Ui, Gi) (by Lemma 3 
part (3)). This gives us the following lemma. 

Lemma 4. Let G be an MDP with a positive reward function r : S — > H + . Let 
f* be the output of Algorithm 1. For all end components U £ £{G) and all states 
u eU n Si, we have max(r(?7)) < f*(u). 

Proof. Let U* = Ui=o ^ ( as computed in Algorithm 1). Then it follows from 
Lemma 3 that for all A £ £(G) we have A n U* ^ 0. Consider A £ £(G) and let 
Vi = max(r(A)). Suppose for some j < i we have ACiUj ^ 0. Then there is a 
strategy to ensure that Uj is reached with probability 1 from all states in A and 
then play an almost-sure winning strategy in Uj to ensure Buchi(r _ (vj) H S- 7 ). 
Then A C Uj. Hence for all u £ A n Si we have /*(«) = Uj > Uj. If for all 
j < i we have A n Uj =0, then we show that A C U. The uniform memoryless 
strategy a a (as used in Lemma 2) in Gi is a witness to prove that 4 C [/,. In 
this case for all u £ A n Si we have /*(u) = Vi = max(r(A)). The desired result 
follows. I 



Algorithm 1 MDPLimSup 



Input: MDP G = ((S, E), (Si,S P ),8), a positive reward function r : S -> IR+. 
Output: S* CS and /* : S* -> IR+ 

1. Let r(S) = {vo,vi, . . . ,Vk} with v > vi > ■ ■ ■ > v k ; 

2. Go := G; S* = 0; 

3. for i := to k do { 

3.1 Ui := ^'(Biichi^- 1 ^) n S 1 )); 

3.2 for all u € Ui n Si 
/*(«) := 

3.3 S 1 * := S* U (Ui n Si); 

3.4 Bi := Attr P (Ui,Gi); 

3.5 G,+i := Gi\Bi,S i+1 := S l \ B t ; 

} 

4. return S* and /*. 



Transformation to MDPs with max objective. Given an MDP G = 
((S, E), (Si,Sp), S) with a positive reward function r : S — > H + , and let S* and 
/* be the output of Algorithm 1. We construct an MDP G = ((S,E), (S 1 ,S P ),S) 
with a reward function r as follows: 

- S = S U S*; i.e., the set of states consists of the state space S and a copy S* 
ofS*. 

- E = EU{{s,s) | s e 5*,s G ^* where s is the copy of s}U{(s,s) | s G S*}; 
i.e., along with edges E, for all states s in S* there is an edge to its copy s 
in S*, and all states in S* are absorbing states. 

- 5i = 5iUS*. 

- 5 = S. 

- r(s) = for all s G S and r(s) = /*(s) for s£ S*, where s is the copy of s. 

We refer to the above construction as limsup conversion. The following lemma 
proves the relationship between the value function ((l))^ o/ (limsup(r)) and 

<(l))<L(max(r)). 

Lemma 5. Let G be an MDP with a positive reward function r : S — > M + . Let 
G and f be obtained from G and r by the limsup conversion. For all states seS, 
we have 

«l»^(limsup(r))( S ) = «l))f a; (max(r))( S ). 
Proof. The result is obtained from the following two case analysis. 

1. Let a be a pure memoryless optimal strategy in G for the objective 
limsup(r). Let C — { C\, C2, ■ ■ ■ , C m } be the set of closed connected re- 
current sets in the Markov chain obtained from G after fixing the strategy 
a. Note that since we consider bipartite MDPs, for all 1 < i < m, we have 



Ci n Si ^ 0. Let C = Ufci C»- We define a pure memoryless strategy cr in 
G as follows 

Us) seSi\C; 

seS" and s G Si fl C. 

By Lemma 4 it follows that the strategy a ensures that for all C,eC and 
all s £ Ci, the maximal reward reached in G is at least max(r(C,)) with 
probability 1. It follows that for all s £ S we have 

«l))^(limsup(r))(«) < ((l))F a; (max(r))( S ). 

2. Let ct be a pure memoryless optimal strategy for the objective max(r) in G. 
We fix a strategy a in G as follows: if at a state s £ S* the strategy <f chooses 
the edge (s,s), then in G on reaching s, the strategy a plays an almost-sure 
winning strategy for the objective Buchi(r _1 (/*(s))), otherwise a follows a. 
It follows that for all s £ S we have 

((l))« ; (limsup(r))( S ) > (<l))f o/ (max(r))( S ). 

Thus we have the desired result. I 

Linear programming for the max objective in G. The following linear pro- 
gram characterizes the value function ((l))^ ai (max(f)). For all s £ S we have 
a variable x s and the objective function is min^ se -^a; s . The set of linear con- 
straints are as follows: 

x s > Vs G ~S; 

x s =r{s) Vse^*; 

x s > Xt Vs £ S\, (s, t) £ E; 

x s = X^ e s<5( s )(*) • x t VseSp. 
The correctness proof of the above linear program to characterize the value 
function ((l))^ ai (max(r)) follows by extending the result for reachability objec- 
tives [6] . The key property that can be used to prove the correctness of the above 
claim is as follows: if a pure memoryless optimal strategy is fixed, then from all 
states in 5, the set S* of absorbing states is reached with probability 1. The 
above property can be proved as follows: since r is a positive reward function, 
it follows that for all s £ S we have ((l))^ a; (limsup(r))(s) > 0. Moreover, for all 
states s £ S we have ((l))^ a; (max(r))(s) = ((l))^ ai (limsup(r))(s) > 0. Observe 
that for all s £ S we have r(s) = 0. Hence if we fix a pure memoryless optimal 
strategy a in G, then in the Markov chain G a there is no closed recurrent set 
G such that G C S. It follows that for all states s £ S, in the Markov G CT , the 
set S* is reached with probability 1. Using the above fact and the correctness 
of linear-programming for reachability objectives, the correctness proof of the 
above linear-program for the objective max(r) in G can be obtained. This shows 
that the value function ((l))^ a; (limsup(r)) for MDPs with reward function r can 
be computed in polynomial time. This gives us the following result. 

Theorem 2. Given an MDP G with a reward function r, the value function 
((l))^ a; (limsup(r)) can be computed in polynomial time. 



Algorithm 2 MDPLimlnf 



Input: MDP G = ((S, E), (Si,S P ),S), a positive reward function r : S -> IR+. 
Output: &CS and /. : -> IR+. 

1. Let r(5) = { vo, vi, . . . , Vk } with v > vi > ■ ■ ■ > v k ; 

2. Go := G; S* = 0; 

3. for i := to k do { 

3.1 U t := ^'(coBiichidJ^i r _1 (vj) n 

3.2 for all u £ ft n Si 
/,(«) := Vt; 

3.3 S 1 * :=S , *U(ftn 1 S'i); 

3.4 Bi := Attr P {Ui,Gi); 

3.5 G l+1 := Gi\Bi, S i+1 := S'\B i; 

} 

4. return 5* and /*. 



3.3 MDPs with liminf objectives 

In this subsection we present polynomial time algorithms for MDPs with liminf 
objectives, and then present the complexity result for 2 ^-player games with 
limsup and liminf objectives. 

Informal description of algorithm. We first present an algorithm that takes 
an MDP G with a positive reward function r : S — > 1R + , and computes a set 
5» and a function /* : 5* — > 1R + . The output of the algorithm will be useful 
in reduction of MDPs with liminf objectives to MDPs with max objectives. Let 
the rewards be vo > v\ > • ■ ■ > Vk- The algorithm proceeds in iteration and 
in iteration i we denote the MDP as G% and the state space as S % . At iteration 
i the algorithm considers the set Vi of reward at least Vi in the MDP Gi, and 
computes the set Ui = M /r 1 Gi (coBiichi(Vi)), (i.e., the almost-sure winning set in 
the MDP d for coBiichi objective with the coBiichi set Vi). For all u e UiPi Si 
we assign = Vi and add the set Ui fl 5i to S*. Then the set Attrp(Ui,Gi) 

is removed from the MDP Gi and we proceed to iteration i + 1. In Gi all end 
components that contain reward at least Vi are contained in Ui (by Lemma 3 
part (2)), and all end components in S l \ Ui do not intersect with Attrp(Ui,Gi) 
(by Lemma 3 part(3)). This gives us the following lemma. 

Lemma 6. Let G be an MDP with a positive reward function r : S — > H + . Let 
/* be the output of Algorithm 2. For all end components U G £{G) and all states 
ueUilSi, we have min(r(?7)) < f*(u). 

Proof. Let U* — Ui=o ^ ( as computed in Algorithm 2). Then it follows from 
Lemma 3 that for all A e £(G) we have A n U* ^ 0. Consider A G £{G) and 
let Vi — min(r(j4)). Suppose for some j < i we have A n Uj ^ 0. Then there is a 
strategy to ensure that Uj is reached with probability 1 from all states in A and 
then play an almost-sure winning strategy in Uj to ensure coBuchi(U ;< (uj)fl 



S j ). Then ACUj. Hence for all u G A n Si we have /«(«) = Vj > v { . If for all 
j < i we have A n t/j =0, then we show that iCf/j. The uniform memory less 
strategy ovi (as used in Lemma 2) in Gi is a witness to prove that AC Ui. In 
this case for all u G A n Si we have /*(w) = Vi = mm(r(A)). The desired result 
follows. I 



Transformation to MDPs with max objective. Given an MDP G = 
((S, E), (Si,Sp), 6) with a positive reward function r : S — > H + , and let S* and 
/* be the output of Algorithm 2. We construct an MDP G = ((S,E),(S 1 ,S P ),6) 
with a reward function r as follows: 

— S = S U S*; i.e., the set of states consists of the state space S and a copy S* 
ofS*. 

— £ = £U{(s,s) | s e S„s e S» where s is the copy of s} U{ (s, s) | s G 5* }; 
along with edges i?, for all states s in S 1 * there is an edge to its copy s in 5* , 
and all states in S* are absorbing states. 

— S\ = Si U S*. 
-6 = 6. 

— r(s) — for all s € S and r(s) = f*(s) for seS„ where s is the copy of s. 

We refer to the above construction as liminf conversion. The following lemma 
proves the relationship between the value function ((l))„ a ; (liminf (r)) and 
«l»f a; (max(r)). 

Lemma 7. Let G be an MDP with a positive reward function r : S — > M + . Let 
G and r be obtained from G and r by the liminf conversion. For all states seS, 
we have 

<(l))^(liminf(r))(5) = «l))f a; (max(r))( S ). 
Proof. The result is obtained from the following two case analysis. 

1 . Let g be a pure memoryless optimal strategy in G for the objective lim inf (r) . 
Let C — { Ci , C2 , ■ ■ ■ , C m } be the set of closed connected recurrent sets in 
the Markov chain obtained from G after fixing the strategy a. Since G is an 
bipartite MDP, it follows that for all 1 < i < m, we have Ci fl S\ ^ 0. Let 
C = UIHi Ci- We define a pure memoryless strategy a in G as follows 



a(s) = 



<t(s) seSiXC; 

s seS* and seftflC. 



By Lemma 6 it follows that the strategy a ensures that for all Cj € C and 
all s G Ci, the maximal reward reached in G is at least min(r(Cj)) with 
probability 1. It follows that for all s G S we have 

<(l))« ; (hmsup(r))( S ) < (<l>}f o/ (max(r))(s). 



2. Let a be a pure memoryless optimal strategy for the objective max(r) in G. 
We fix a strategy a in G as follows: if at a state seS, the strategy W chooses 
the edge (s,s), then in G on reaching s, the strategy cr plays an almost-sure 
winning strategy for the objective coBuchi((J, u . > ^ s ) r -1 ^-)), otherwise a 
follows a. It follows that for all s € S we have 

((l>>^(liminf(r))(s) > ((l})^(max(r))(s). 

Thus we have the desired result. I 

Linear programming for the max objective in G. The linear program of 
subsection 3.2 characterizes the value function (( 1 (max (r)). This shows that 
the value function ((l))^ a( (liminf (r)) for MDPs with reward function r can be 
computed in polynomial time. This gives us the following result. 

Theorem 3. Given an MDP G with a reward function r, the value function 
((l))^ a ;(liminf(r)) can be computed in polynomial time. 

3.4 2 1 /2-player games with limsup and liminf objectives 

We now show that 2 1 /2-player games with limsup and liminf objectives can be 
decided in NP n coNP. The pure memoryless optimal strategies (existence follows 
from Theorem 1) provide the polynomial witnesses and to obtain the desired 
result we need to present a polynomial time verification procedure. In other 
words, we need to present polynomial time algorithms for MDPs with limsup 
and liminf objectives. Since the value functions in MDPs with limsup and liminf 
objectives can be computed in polynomial time (Theorem 2 and Theorem 3), we 
obtain the following result about the complexity 2 !/2-player games with limsup 
and liminf objectives. 

Theorem 4. Given a 2 1 /2-player game graph G with a reward function r, 
a state s and a rational value q, the following assertions hold: (a) whether 
((l))^ j(limsup(r))(s) > q can be decided in NP n coNP; and (b) whether 
((l))^ j(liminf(r))(s) > q can be decided in NP n coNP. 
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